Red Wine Quality Exploration by Sheng Weng

Univariate Plots Section

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

## [1] 1319   13
## [1] 63 13
## [1] 217  13

The quality distribution seems to be normal, about 82.4% of the red wines are rated as 5 and 6. Only 10 wines are quality 3 and 15 wines are quality 8. My initial thought is that the variables that have strong impact on wine quality should also have normal distribution.

Inspect the histograms of all the variables, taking a first look at them.

The peak is around 7, and the diagram is right skewed. So I log transform it.

It looks like the histogram of volatile.acidity has two peaks at 0.4 and 0.7, and it’s right skewed. I’m going to use a log transform.

There are some missing values at low x axis value on the transformed figure.

There is a striking high bar at zero and another one at 0.5.

It’s like a normal distribution with skewed tail. Most wines have residual sugar less than 4. I also log transform it.

This figure looks similar as the residual.sugar one. There is a high peak at around 0.08.

The histogram of free.sulfur.dioxide and total.sulfur.dioxide look the same. They are both right skewed a lot, with high count at low sulfur dioxide level.

Density and pH seem to have similar normal distribution. Most wines have density 0.997 and pH 3.4.

Sulphates has normal distribution with a right skewed tail. The peak is around 0.7. I log transform it.

There is a high peak at alcohol level around 8, and the distribution is right skewed. I log transform it.

I am going to exclusively look at good wines with quality 7 and 8, trying to figure out if they have some common characteristics.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   8.0   Min.   : 4.900   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 482.0   1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000  
##  Median : 939.0   Median : 8.700   Median :0.3700   Median :0.4000  
##  Mean   : 831.7   Mean   : 8.847   Mean   :0.4055   Mean   :0.3765  
##  3rd Qu.:1089.0   3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900  
##  Max.   :1585.0   Max.   :15.600   Max.   :0.9150   Max.   :0.7600  
##  residual.sugar    chlorides       free.sulfur.dioxide
##  Min.   :1.200   Min.   :0.01200   Min.   : 3.00      
##  1st Qu.:2.000   1st Qu.:0.06200   1st Qu.: 6.00      
##  Median :2.300   Median :0.07300   Median :11.00      
##  Mean   :2.709   Mean   :0.07591   Mean   :13.98      
##  3rd Qu.:2.700   3rd Qu.:0.08500   3rd Qu.:18.00      
##  Max.   :8.900   Max.   :0.35800   Max.   :54.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9906   Min.   :2.880   Min.   :0.3900  
##  1st Qu.: 17.00       1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500  
##  Median : 27.00       Median :0.9957   Median :3.270   Median :0.7400  
##  Mean   : 34.89       Mean   :0.9960   Mean   :3.289   Mean   :0.7435  
##  3rd Qu.: 43.00       3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200  
##  Max.   :289.00       Max.   :1.0032   Max.   :3.780   Max.   :1.3600  
##     alcohol         quality     
##  Min.   : 9.20   Min.   :7.000  
##  1st Qu.:10.80   1st Qu.:7.000  
##  Median :11.60   Median :7.000  
##  Mean   :11.52   Mean   :7.083  
##  3rd Qu.:12.20   3rd Qu.:7.000  
##  Max.   :14.00   Max.   :8.000

I compared the summary of all the wines and the summary of the good wines. I calculated how much the mean value for each variable has changed. Based on the results, I divided the 11 variables into four groups:
1. Strong change (>20%): volatile.acidity, citric.acid, total.sulfur.dioxide.
2. Median change (10% - 13%): chlorides, free.sulfur.dioxide, sulphates, alcohol.
3. Small change (~6%): fixed.acidity, residual.sugar.
4. Tiny change (<1%): density, pH.

Similarly, I want to create a group of wines with quality 3 & 4, and try to investigate how the variables change when the quality goes down.

##        X          fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :  19.0   Min.   : 4.600   Min.   :0.2300   Min.   :0.0000  
##  1st Qu.: 435.0   1st Qu.: 6.800   1st Qu.:0.5650   1st Qu.:0.0200  
##  Median : 834.0   Median : 7.500   Median :0.6800   Median :0.0800  
##  Mean   : 837.7   Mean   : 7.871   Mean   :0.7242   Mean   :0.1737  
##  3rd Qu.:1285.5   3rd Qu.: 8.400   3rd Qu.:0.8825   3rd Qu.:0.2700  
##  Max.   :1522.0   Max.   :12.500   Max.   :1.5800   Max.   :1.0000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 1.200   Min.   :0.04500   Min.   : 3.00      
##  1st Qu.: 1.900   1st Qu.:0.06850   1st Qu.: 5.00      
##  Median : 2.100   Median :0.08000   Median : 9.00      
##  Mean   : 2.685   Mean   :0.09573   Mean   :12.06      
##  3rd Qu.: 2.950   3rd Qu.:0.09450   3rd Qu.:15.50      
##  Max.   :12.900   Max.   :0.61000   Max.   :41.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  7.00       Min.   :0.9934   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 13.50       1st Qu.:0.9957   1st Qu.:3.300   1st Qu.:0.4950  
##  Median : 26.00       Median :0.9966   Median :3.380   Median :0.5600  
##  Mean   : 34.44       Mean   :0.9967   Mean   :3.384   Mean   :0.5922  
##  3rd Qu.: 48.00       3rd Qu.:0.9977   3rd Qu.:3.500   3rd Qu.:0.6000  
##  Max.   :119.00       Max.   :1.0010   Max.   :3.900   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.60   1st Qu.:4.000  
##  Median :10.00   Median :4.000  
##  Mean   :10.22   Mean   :3.841  
##  3rd Qu.:11.00   3rd Qu.:4.000  
##  Max.   :13.10   Max.   :4.000

If the variables have strong impact on the wine quality, I’m expecting that their mean values will have inverse change for good and bad wines as compared to all wines. Based on this criterion, I further regroup all the 11 variables (suspected):

1. Strong impact: volatile.acidity, citric.acid.
2. Median impact: chlorides, sulphates.
3. Small impact: fixed.acidity, free.sulfur.dioxide, alcohol.
4. Tiny impact: residual.sugar, total.sulfur.dioxide, density, pH.

Surprisingly, the mean value of total.sulfur.dioxide for both good and bad wines drops more than 25% as compared to that of all wines. So we cannot rely on this parameter to decide the wine quality.

Although the above grouping is solely based on the mean value change, we are assured that volatile.acidity and citric.acid must have strong correlation with the wine quality.

Let’s compare the histogram of these two variables in all-wine group, good-wine group, and bad-wine group.

So most good wines have volatile acidity lower than 0.8, while the bad wines tend to have wider distributed and discrete volatile acidity value.
As for citric acid, a lot of good wines have the value between 0.3 and 0.7, but just a few bad wines have this range of citric acid value.

I’m interested to see how much the fixed acidity accounts for the total acidity. I assume the total acidity can be calculated as the sum of fixed acidity and volatile acidity. So I create a new variable named “fixed.acidity.percent”, which is calculated by: fixed.acidity / (fixed.acidity + volatile.acidity)
I also created a pH.bucket variable to divide pH into five groups.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 11 attributes that may have impact on the wine quality. All the variables are numbers. There is no NA in this dataset.
1319 out of 1599 red wines are rated as 5 and 6.
The histograms of density and pH are close to normal distribution.
There is a high peak for citric.acid equals zero.
The histograms of free.sulfur.dioxide and total.sulfur.dioxide have similar distribution, suggesting that these two variables may have strong correlation.

What is/are the main feature(s) of interest in your dataset?

I suspect that volatile.acidity and citric.acid are the two major features that determine the quality of wine. The mean value of volatile acidity for good wine is 0.4055, for bad wine is 0.7242. The median value of citric acid for good wine is 0.4, while for bad wine it’s only 0.08. Some other variables might have minor impact on the wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Chlorides, sulphates, fixed.acidity, free.sulfur.dioxide, and alcohol might have median or small impact on the quality of wine.

Did you create any new variables from existing variables in the dataset?

I created a new variable named “fixed.acidity.percent” because I’m interested to see how much the fixed acidity accounts for the total acidity, which may have influence on the wine quality.
I created quality.bucket variable to divide the wine into three groups based on their quality. I also created a pH.bucket variable to divide pH into five groups.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I noticed that the histogram of volatile.acidity seems to have two distinct peaks. So I log-transformed it to make these two peaks more clear. It looks like there’s one peak around 0.4 and another peak around 0.7. These two peaks correspond well with the mean values of the volatile.acidity for good and bad wine groups. The mean value of volatile acidity for good wine is 0.4055, for bad wine is 0.7242.

Bivariate Plots Section

From the above scatter matrices, it turns our that the correlation coefficients of quality versus volatile.acidity, citric.acid, sulphates, and alcohol are higher than other variables.
What also interest me are the following pairs of variables that have strong correlation (> 0.5):
1. free.sulfur.dioxide v.s. total.sulfur.dioxide
2. fixed.acidity v.s. density, pH

Next, I want to look at the boxplots involving quality and other variables.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$volatile.acidity and pf$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Good wines have mean volatile acidity lower than 0.4. The correlation between volatile acidity and quality is -0.391

## 
##  Pearson's product-moment correlation
## 
## data:  pf$citric.acid and pf$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Mean citric.acid value for bad wines are lower than 0.2, while for good wines it’s higher than 0.3. The correlation between citric acidity and quality is 0.226

## 
##  Pearson's product-moment correlation
## 
## data:  pf$sulphates and pf$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Good wines have higher mena sulphates values than bad wines, although the difference is not that big. The correlation between these two variables is 0.251

## 
##  Pearson's product-moment correlation
## 
## data:  pf$alcohol and pf$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Although quality 5 wines have lower mean alcohol value than quality 4, the good wines have much higher mean alcohol value than bad wines. The correlation of these two is 0.476

## 
##  Pearson's product-moment correlation
## 
## data:  pf$fixed.acidity.percent and pf$quality
## t = 14.784, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3030968 0.3893614
## sample estimates:
##       cor 
## 0.3469627

Quality 7 wines have the highest mean fixed.acidity.percent.

Based on the above boxplots, volatile.acidity and citric.acid play important roles in determining the wine quality. Also, I will mainly focus on “sulphates” and “alcohol” among the median and small impact factors that I mentioned in the Univariate Analysis.

Next, I want to see the scatter plot of free.sulfur.dioxide vs total.sulfur.dioxide.

Although the relationship does not look like linear, all the points seem to be confined in a cone plane.

I also take a look at the relationship between density and pH, alcohol and volatile acidity, sulphates and alohol. Most of the wines seem to have pH ~ 3.3 and density ~ 0.996.

The majority of the red wines have alcohol level lower than 11, and volatile acidity from 0.3 to 0.4.

sulphates is mostly at 0.4 ~ 0.8.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

According to the boxplots, volatile.acidity generally decreases as the quality goes up, and citric.acid increases as the quality goes up.
The correlation between volatile acidity and quality is -0.391. The correlation between citric acidity and quality is 0.226.
Additionally, good wines usually have higher sulphates and alcohol levels. The correlation between sulphates and quality is 0.251. The correlation between alcohol and quality is 0.476.
From the scatterplot of alcohol vs sulphastes, I noticed that the majority of the red wines have alcohol level lower than 11 (% by volume), and sulphates lower than 0.6 g/dm^3.
The variation of volatile acidity looks bigger than that of sulphates.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Fixed.acidity.percent increases as the quality goes up.
Most of the wines seem to have pH ~ 3.3 and density ~ 0.996, according to the scatter plot of density and pH.

What was the strongest relationship you found?

The two features that influence the quality most are confirmed to be volatile.acidity and citric.acid. The correlation between volatile acidity and quality is -0.391. The correlation between citric acidity and quality is 0.226.

Multivariate Plots Section

Quality 7 wines have volatile acidity around 0.4 and citric acid around 0.35. For quality 5 & 6, they cover quite a large range of citric acid value.

I compare the volatile acidity and citric acid for good and bad wines exclusively. I notice that most good wines have volatile acidity from 0.25 to 0.5, and citric acid from 0.2 to 0.6. The bad wines points are more scattered, but they tend to have volatile acidity more than 0.5, and citric acid lower than 0.3.

For good wines the sulphates value is mostly from 0.6 to 0.9, the alcohol value from 10 to 13. For bad wines, on the other hand, the sulphates value is mostly from 0.25 to 0.65, and the alcohol value from 9 to 11.5.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$pH and pf$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

We can see different pH buckets gathered at quality 5, 6, 7, and 8. There’s no clear relationship between pH and quality, which indicates that pH is not a good variable to divide good and bad wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

It is confirmed that higher citric.acid and lower volatile.acidity contribute towards better wines. Also, better wines tend to have higher sulphates and alcohol content.

Were there any interesting or surprising interactions between features?

From the forth plot, it turns out that pH has very little impact on wine quality, although the distribution of pH is also normal. The correlation between pH and quality is only -0.0577, less than the threshold value for two variables to be correlated.


Final Plots and Summary

Plot One

Description One

I first choose to plot the histogram of wine quality from the data set, because this is the main variable that I’m interested to investigate. I want to know what features may change the wine quality. The quality distribution seems to be normal, about 82.4% of the red wines are rated as 5 and 6. 63 wines are of quality 3 & 4, and 217 wines are of quality 7 & 8. So we can define bad wines group to have quality lower than 5, and good wines group with quality higher than 6.

Plot Two

Description Two

Citric acid is one of the major variables that I suspect to have strong impact on wine quality. So I use box plot to see the mean citric acid value, and the quantiles of different wine quality. I notice that better wines tend to have higher value of citric acid. The mean value of citric acid increases as the wine quality is getting better. It’s between 0 and 0.2 for bad wines, 0.2 ~ 0.4 for quality 5 & 6, and equal or over 0.4 for good wines. What’s more, good wines have smaller citric acid variation. This result verifies that citric acid is indeed a major contribution to wine quality.

Plot Three

Description Three

Sulphates and alcohol values are the ohter two features that I suspect to influence the wine quality a lot. So I plot alcohol versus sulphates and use different colors to represent different wine quality. I exclusively look at good wines (quality 7 & 8, blue) and bad wines (quality 3 & 4, brown) in order to make the trend more clear. Most of the blue and dark blue points are gathered in the right corner of this plot, these points have sulphates values from 0.7 to 1.25, and alcohol value from 10 to 14. This indicates that better wines usually have higher alcohol and sulphates levels.


Reflection

Through this exploratory data analysis, I identified the key features that determine the red wine quality. I learned that we must not only look at univariate plots, but also two or multiple variables to carefully inspect different possibilities. For example, the normal distributed pH gave me a feel that it might affect the wine quality a lot as the histogram of quality is also normal. However, after looking at the boxplot of pH vs quality, it turned out that pH does not have that strong correlation with wine quality. Therefore, we need to verify our idea through in-depth research.

I improved my EDA skills a lot through this study. I learned that better analysis can be generated by removing the extreme outliers in the data. I learned that giving clear statistics along with proper plots can enhance the analysis. I also learned that detailed description is an important part of EDA. I spent a lot of time adding more comments and extending my discussion on the plots so that my ideas can be better conveyed through the reports.

It is proved that there are four factors that mainly involved in the determination of quality: citric.acid, volatile.acidity, alcohol, and sulphates. It is important to note, however, that wine quality is subjective to vary as different wine experts may have different tastes. It would be better to know the background of these wine experts, as experts from France and India may have different standards on evaluating wine quality. Also, as we see from the histogram of wine quality, it is definitely not a perfect normal-distribution. It would be a great help if the experts can give a more precise scale, for example, 3, 3.5, 4, … , 7, 7.5, 8. That way, this data set may generate more convincing results.